Scotty is a ride-sharing business that operating in several big cities in Turkey. The company provides motorcycles ride-sharing service for Turkey’s citizen, and really value the efficiency in traveling through the traffic–the apps even give some reference to Star Trek “beam me up” in their order buttons.
Scotty provided us with real-time transaction dataset. With this dataset, we are going to help them in solving some forecasting and classification problems in order to improve their business processes.
Scotty turns out being a very popular service in Turkey! The demands for Scotty began to overload, at some region and some times, and there was not enough driver at those times and places. Fortunately, we are know that we can use classification model to predict which region and times are risky enough to have this “no drivers” problem.
Create a classification model report that would be evaluated on next 7 days (Sunday, December 3rd 2017 to Saturday, December 9th 2017). Make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”.
Our target is to get information which hour and area that have “insufficient” label, it mean we want to know which hour and area that really need more driver.
To solve this case we need import some package to help process and predict:
library(GGally)
library(tidyverse)
library(lubridate)
library(scales)
library(rsample)
library(caret)
library(partykit)
# Smote for unbalanced data
library(DMwR)
# KNN modelling
library(class)
#ROCR
library(ROCR)
# Naive Bayes
library(e1071)
# Visualize
library(plotly)
source("script/matrix_result.R")
source("script/metrics.R")
confussionMatrixPlot <- function(table) {
table %>% ggplot(aes(x = Prediction, y = Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), fontface = "bold", color = "white") +
theme_minimal() +
theme(legend.position = "none")
}
Scotty provide transaction detail transaction from October 1st, 2017 to December 2nd, 2017 as train dataset. Based on this train data we can start process to make model and predict.
train <- read_csv("data/data-train.csv")
## Parsed with column specification:
## cols(
## id = col_character(),
## trip_id = col_character(),
## driver_id = col_character(),
## rider_id = col_character(),
## start_time = col_datetime(format = ""),
## src_lat = col_double(),
## src_lon = col_double(),
## src_area = col_character(),
## src_sub_area = col_character(),
## dest_lat = col_double(),
## dest_lon = col_double(),
## dest_area = col_character(),
## dest_sub_area = col_character(),
## distance = col_double(),
## status = col_character(),
## confirmed_time_sec = col_double()
## )
test <- read_csv("data/data-test.csv")
## Parsed with column specification:
## cols(
## src_area = col_character(),
## datetime = col_datetime(format = ""),
## coverage = col_logical()
## )
Lets check structure of train data:
glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <chr> "sxk9", "sxk8", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ src_sub_area <chr> "sxk9s", "sxk8y", "sxk9e", "sxk9s", "sxk9e", "sxk9…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <chr> "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "sxk9", "s…
## $ dest_sub_area <chr> "sxk9u", "sxk9s", "sxk9e", "sxk9e", "sxk9e", "sxk9…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <chr> "confirmed", "confirmed", "nodrivers", "confirmed"…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
The dataset includes information about:
Based on data above, we cant see coverage variable inside train data. but we found train data structure contain detail order from scotty. Some variable have mismatch datatype, we will try to change data type:
train <- train %>%
mutate(src_area = as.factor(src_area),
src_sub_area = as.factor(src_sub_area),
dest_area = as.factor(dest_area),
dest_sub_area = as.factor(dest_sub_area),
status = as.factor(status)
)
lets preview contain of data to understand more train data:
head(train)
## # A tibble: 6 x 16
## id trip_id driver_id rider_id start_time src_lat src_lon src_area
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <fct>
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17 41.1 29.0 sxk9
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34 40.9 29.1 sxk8
## 3 59d0… <NA> <NA> 59cd704… 2017-10-01 00:02:34 41.1 29.0 sxk9
## 4 59d0… 59d006… 599dc0df… 59bd62c… 2017-10-01 00:03:29 41.0 29.0 sxk9
## 5 59d0… 59d007… 59a58565… 59c17cd… 2017-10-01 00:04:24 41.1 29.0 sxk9
## 6 59d0… 59d007… 59579d5f… 594afb1… 2017-10-01 00:05:32 41.0 28.9 sxk9
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## # dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## # confirmed_time_sec <dbl>
we found there’s any NA value inside trip_id and rider_id, to make it more clear i will check NA value from all column before we take any action to data wrangling
summary(is.na(train))
## id trip_id driver_id rider_id
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:214631 FALSE:214632 FALSE:229532
## TRUE :14901 TRUE :14900
## start_time src_lat src_lon src_area
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:229532 FALSE:229532 FALSE:229532
##
## src_sub_area dest_lat dest_lon dest_area
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:229532 FALSE:229532 FALSE:229532
##
## dest_sub_area distance status confirmed_time_sec
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:229532 FALSE:229532 FALSE:229532 FALSE:229532
##
We found NA value only trip_id and driver_id, lets compare some row with NA value with non-NA row
train[c(1:3),]
## # A tibble: 3 x 16
## id trip_id driver_id rider_id start_time src_lat src_lon src_area
## <chr> <chr> <chr> <chr> <dttm> <dbl> <dbl> <fct>
## 1 59d0… 59d005… 59a892c5… 59ad2d6… 2017-10-01 00:00:17 41.1 29.0 sxk9
## 2 59d0… 59d006… 59a13556… 59ce930… 2017-10-01 00:02:34 40.9 29.1 sxk8
## 3 59d0… <NA> <NA> 59cd704… 2017-10-01 00:02:34 41.1 29.0 sxk9
## # … with 8 more variables: src_sub_area <fct>, dest_lat <dbl>, dest_lon <dbl>,
## # dest_area <fct>, dest_sub_area <fct>, distance <dbl>, status <fct>,
## # confirmed_time_sec <dbl>
We found there’s any relation why trip_id and driver_id NA, is because status is “nodrivers”. and it will fullfil if status is “confirmed”. It will be our clue to extract information of coverage data.
Based on our target we have before, we will predict 7 days for each hour and each area. Why we dont specified time level only days, we cant explain detail which time get more dirver order. but if we make level into minute the data will too specific and result the data we have will make us confuse. So hourly interval is best time level to predict.
We can see from structure we dont have hour variable. We have start_time as users order time, we can extract specific level from Date, Days, Hour:
train_hourly <- train %>%
mutate(src_area = as.factor(src_area),
date = date(start_time),
day = as.factor(wday(start_time, label = TRUE, abbr = FALSE)),
hour = as.factor(hour(start_time)))
glimpse(train_hourly)
## Observations: 229,532
## Variables: 19
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
## $ date <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 2…
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Su…
## $ hour <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
we have new variable date it explain date order happen,day it explain day, and hour it explain hours when user order driver.
summary(train_hourly)
## id trip_id driver_id rider_id
## Length:229532 Length:229532 Length:229532 Length:229532
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## start_time src_lat src_lon src_area
## Min. :2017-10-01 00:00:17 Min. :40.79 Min. :28.48 sxk3: 9946
## 1st Qu.:2017-10-19 19:36:04 1st Qu.:41.00 1st Qu.:28.97 sxk8: 5636
## Median :2017-11-07 14:08:48 Median :41.03 Median :29.00 sxk9:213950
## Mean :2017-11-04 18:32:48 Mean :41.03 Mean :29.00
## 3rd Qu.:2017-11-19 03:44:19 3rd Qu.:41.06 3rd Qu.:29.04
## Max. :2017-12-02 23:59:20 Max. :41.13 Max. :29.18
##
## src_sub_area dest_lat dest_lon dest_area
## sxk9e :40869 Min. : 5.923 Min. : 0.1273 sxk9 :205449
## sxk9s :24732 1st Qu.:41.001 1st Qu.: 28.9689 sxk3 : 12516
## sxk97 :24512 Median :41.036 Median : 29.0036 sxk8 : 4973
## sxk9k :16612 Mean :41.032 Mean : 29.0025 sxkc : 2264
## sxk9j :12336 3rd Qu.:41.064 3rd Qu.: 29.0475 sxkb : 1994
## sxk9h :12205 Max. :52.370 Max. :121.4721 sxkd : 1911
## (Other):98266 (Other): 425
## dest_sub_area distance status confirmed_time_sec
## sxk9e : 35941 Min. : 0.000 confirmed:214631 Min. : 0.00
## sxk9s : 25346 1st Qu.: 2.802 nodrivers: 14901 1st Qu.: 12.00
## sxk97 : 23031 Median : 4.804 Median : 26.00
## sxk9k : 13868 Mean : 6.772 Mean : 46.41
## sxk9j : 13704 3rd Qu.: 7.943 3rd Qu.: 66.00
## sxk9h : 10423 Max. :9345.566 Max. :514.00
## (Other):107219
## date day hour
## Min. :2017-10-01 Sunday :21063 18 : 25403
## 1st Qu.:2017-10-19 Monday :27425 19 : 20572
## Median :2017-11-07 Tuesday :31190 17 : 19156
## Mean :2017-11-04 Wednesday:31999 16 : 16090
## 3rd Qu.:2017-11-19 Thursday :32251 8 : 14938
## Max. :2017-12-02 Friday :50726 15 : 14227
## Saturday :34878 (Other):119146
To make dataset convert perfectly to hourly interval, we need grouping transaction data and summarise data to hourly.
Example we have data with time 06:01:00,06:30:00,07:00:00 so it will group 06:00:00 have 2 transaction and 07:00:00 have 1 transaction.
We will grouping data by src_area, date, hour, and status, for data aggregation. Beside that after we grouping we summarise total_transaction it explain how many transaction within interval, and at status we have 2 class we already know before confirmed and nodrivers, we will count frequency:
coverage_hourly <- train_hourly %>%
group_by(src_area, date, day, hour, status) %>%
summarise(n = n()) %>%
ungroup() %>%
spread(status, n, fill=0) %>%
mutate(
total_transaction = confirmed + nodrivers,
datetime = ymd_h(paste(date, hour)))
#preview the data
head(coverage_hourly)
## # A tibble: 6 x 8
## src_area date day hour confirmed nodrivers total_transacti…
## <fct> <date> <ord> <fct> <dbl> <dbl> <dbl>
## 1 sxk3 2017-10-01 Sund… 0 0 1 1
## 2 sxk3 2017-10-01 Sund… 1 1 1 2
## 3 sxk3 2017-10-01 Sund… 3 0 1 1
## 4 sxk3 2017-10-01 Sund… 4 0 1 1
## 5 sxk3 2017-10-01 Sund… 7 0 1 1
## 6 sxk3 2017-10-01 Sund… 8 1 2 3
## # … with 1 more variable: datetime <dttm>
To make prediction model and good exploratory data analysis, we need complete time interval data. lets take look summary of our interval data:
summary(coverage_hourly$datetime)
## Min. 1st Qu. Median
## "2017-10-01 00:00:00" "2017-10-16 19:00:00" "2017-11-01 17:00:00"
## Mean 3rd Qu. Max.
## "2017-11-01 15:05:47" "2017-11-17 10:00:00" "2017-12-02 23:00:00"
based on summary our data is start from 2017-10-01 00:00:00 to 2017-12-02 23:00:00. We will determine for as start and end of interval
sorted_datetime <- coverage_hourly[order(coverage_hourly$datetime),]
datetime_start <- sorted_datetime$datetime[1]
datetime_end <- sorted_datetime$datetime[length(sorted_datetime$datetime)]
datetime_start
## [1] "2017-10-01 UTC"
datetime_end
## [1] "2017-12-02 23:00:00 UTC"
Make complete list from start interval until end of interval in hourly
all_date <- seq(datetime_start, datetime_end, by = "hour")
all_date <- data.frame(list(datetime = all_date))
# total hours from start to end date
nrow(all_date)
## [1] 1512
Total we have 1512 hour interval from start to the end. we will padding implement new time interval to our train data
sxk3 <- coverage_hourly %>%
filter(src_area == "sxk3") %>%
arrange(desc(datetime)) %>%
merge(all_date, all = T) %>%
mutate(src_area = as.factor("sxk3"))
sxk8 <- coverage_hourly %>%
filter(src_area == "sxk8") %>%
arrange(desc(datetime)) %>%
merge(all_date, all = T) %>%
mutate(src_area = as.factor("sxk8"))
sxk9 <- coverage_hourly %>%
filter(src_area == "sxk9") %>%
arrange(desc(datetime)) %>%
merge(all_date, all = T) %>%
mutate(src_area = as.factor("sxk9"))
coverage_all_date <- do.call("rbind", list(sxk3, sxk8, sxk9))
head(coverage_all_date)
## datetime src_area date day hour confirmed nodrivers
## 1 2017-10-01 00:00:00 sxk3 2017-10-01 Sunday 0 0 1
## 2 2017-10-01 01:00:00 sxk3 2017-10-01 Sunday 1 1 1
## 3 2017-10-01 02:00:00 sxk3 <NA> <NA> <NA> NA NA
## 4 2017-10-01 03:00:00 sxk3 2017-10-01 Sunday 3 0 1
## 5 2017-10-01 04:00:00 sxk3 2017-10-01 Sunday 4 0 1
## 6 2017-10-01 05:00:00 sxk3 <NA> <NA> <NA> NA NA
## total_transaction
## 1 1
## 2 2
## 3 NA
## 4 1
## 5 1
## 6 NA
Implement new interval data cause NA value for some row, it because from our train data dont have any transaction during some interval taime. We will fill this NA with 0 value:
coverage_all_date <- coverage_all_date %>%
mutate(
date = date(datetime),
hour = as.factor(hour(datetime)),
day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
confirmed = replace_na(confirmed, 0),
nodrivers = replace_na(nodrivers,0),
total_transaction = replace_na(total_transaction ,0),
)
head(coverage_all_date)
## datetime src_area date day hour confirmed nodrivers
## 1 2017-10-01 00:00:00 sxk3 2017-10-01 Sunday 0 0 1
## 2 2017-10-01 01:00:00 sxk3 2017-10-01 Sunday 1 1 1
## 3 2017-10-01 02:00:00 sxk3 2017-10-01 Sunday 2 0 0
## 4 2017-10-01 03:00:00 sxk3 2017-10-01 Sunday 3 0 1
## 5 2017-10-01 04:00:00 sxk3 2017-10-01 Sunday 4 0 1
## 6 2017-10-01 05:00:00 sxk3 2017-10-01 Sunday 5 0 0
## total_transaction
## 1 1
## 2 2
## 3 0
## 4 1
## 5 1
## 6 0
Our case is to make prediction that should cover the predicted coverage status for each hour and each area: “sufficient” or “insufficient”. Next we will make our target variable coverage based on variable we have on dataset. status is variable we use labeling coverage, status with nodriver > 0, will label as insufficient and if status nodriver = 0, we will label it sufficient.
coverage_all_date <- coverage_all_date %>%
mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))
head(coverage_all_date)
## datetime src_area date day hour confirmed nodrivers
## 1 2017-10-01 00:00:00 sxk3 2017-10-01 Sunday 0 0 1
## 2 2017-10-01 01:00:00 sxk3 2017-10-01 Sunday 1 1 1
## 3 2017-10-01 02:00:00 sxk3 2017-10-01 Sunday 2 0 0
## 4 2017-10-01 03:00:00 sxk3 2017-10-01 Sunday 3 0 1
## 5 2017-10-01 04:00:00 sxk3 2017-10-01 Sunday 4 0 1
## 6 2017-10-01 05:00:00 sxk3 2017-10-01 Sunday 5 0 0
## total_transaction coverage
## 1 1 insufficient
## 2 2 insufficient
## 3 0 sufficient
## 4 1 insufficient
## 5 1 insufficient
## 6 0 sufficient
Until this step, we have label as target variable is coverage with 2 level “sufficient” and “insufficient” in train data.
Check proportion for our target variable coverage, on process we have prepare some set of data frame. coverage_hourly data before we padding it, and after padding data we have coverage_all_date. let check proportion before padding and after padding.
coverage_hourly <- coverage_hourly %>%
mutate(coverage = as.factor(if_else(nodrivers == 0, "sufficient", "insufficient")))
proportion <- data.frame(prop.table(table(coverage_hourly$coverage)))
proportion %>%
ggplot(aes(x = Var1, y = Freq)) +
geom_bar(aes(fill = Freq), stat = "identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = percent(Freq)),
fontface = "bold",
color = "white",
position = position_stack(vjust = 0.5)) +
theme(legend.position = "none") +
labs(
title = "Coverage proportion before padding",
x = "",
y = ""
)
Before padding proportion of coverage is 64% of data label as insufficient and 36% of data label as sufficient. Next we show train data after implement padding or new time interval
proportion <- data.frame(prop.table(table(coverage_all_date$coverage)))
proportion %>%
ggplot(aes(x = Var1, y = Freq)) +
geom_bar(aes(fill = Freq), stat = "identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = percent(Freq)),
fontface = "bold",
color = "white",
position = position_stack(vjust = 0.5)) +
theme(legend.position = "none") +
labs(
title = "Coverage proportion after padding",
x = "",
y = ""
)
Proportion after we padding is more better before we padding, proportion is 54.1& labeled as insufficient and 45.9% labeled as sufficient.
This dataset have src_are explain different area, lets check how is frequency from our train dataset:
proportion <- data.frame(summary(coverage_all_date$src_area)) %>%
rownames_to_column(., "VALUE") %>%
mutate("Source Area" = .[,1],
"Freq" = .[,2]) %>%
select(3,4)
proportion
## Source Area Freq
## 1 sxk3 1512
## 2 sxk8 1512
## 3 sxk9 1512
it show for each area have same frequency data 1512 transaction, its because we has padding the train dataset before. Lets check proportion of coverage for each area:
proprotion <- coverage_all_date %>%
group_by(src_area) %>%
summarise(insufficient = sum(coverage == "insufficient"),
sufficient = sum(coverage == "sufficient"),
total = n(),
"proportion insufficient" = insufficient/n(),
"proportion sufficient" = sufficient/n())
proprotion
## # A tibble: 3 x 6
## src_area insufficient sufficient total `proportion insuffi… `proportion suffi…
## <fct> <int> <int> <int> <dbl> <dbl>
## 1 sxk3 947 565 1512 0.626 0.374
## 2 sxk8 258 1254 1512 0.171 0.829
## 3 sxk9 1249 263 1512 0.826 0.174
More better we visualize it
proprotion %>%
select("src_area","proportion insufficient", "proportion sufficient") %>%
rename("insufficient" = "proportion insufficient",
"sufficient" = "proportion sufficient") %>%
gather(key = "prop",
value = "value",
-src_area) %>%
ggplot(aes(x = prop, y = value)) +
geom_bar(aes(fill = prop), stat = "identity") +
scale_y_continuous(labels = scales::percent) +
geom_text(aes(label = percent(value)),
fontface = "bold",
color = "white",
position = position_stack(vjust = 0.5)) +
facet_wrap( ~ src_area) +
labs(
title = "Coverage proportion group by Area",
x = "",
y = ""
) +
theme(legend.position = "none")
From 3 area we have, only sxk8 that have more than 50% sufficient label, it indicate other area sxk3 and sxk9 need more driver in those area. lets check other visualize to get better insight.
We will try to use heatmap type visualization
coverage_all_date %>%
mutate(day = as.factor(wday(datetime, label = TRUE, abbr = TRUE))) %>%
group_by(day, hour, src_area) %>%
summarise(sufficient = sum(coverage == "sufficient")) %>%
ggplot(aes(x = day, y = hour, fill = sufficient)) +
geom_tile() +
facet_wrap(~src_area) +
theme(legend.position = "bottom")
To make more clear and get detail insight, i would make other visualization using bar plot:
coverage_all_date %>%
select(-c("datetime", "date", "day")) %>%
group_by(hour, src_area) %>%
summarise(insufficient = sum(coverage == "insufficient"),
sufficient = sum(coverage == "sufficient")) %>%
gather("type","value",-src_area, -hour) %>%
ggplot(aes(x = hour, y = value, fill = type)) +
geom_col(position = "dodge") +
facet_grid(src_area ~ .) +
theme(legend.position = "bottom") +
labs(title = "Coverage hourly group by Area")
some insight we can get is:
1. sxk9 area is the most insufficient driver area, from heatmap all days its show dark area it mean all days and every hour users didnt get driver when they order. 2. sxk8 area is the most sufficient driver area, every hour and every hour confirmed driver than other area. 3. However area sxk8 still in 08:00 it nearly insufficient label more than sufficient label, especially in working days from monday to friday, it show more dark blue on heatmap. 4.sxk3 area get more driver start from morning 06:00 which people start to activity until late night around 00:00 and this case happen always every day looks from heatmap. 5. There is corellation between days, hour, area to coverage.
glimpse(train)
## Observations: 229,532
## Variables: 16
## $ id <chr> "59d005e1ffcfa261708ce9cd", "59d0066a3d32b861760d4…
## $ trip_id <chr> "59d005e9cb564761a8fe5d3e", "59d00678ffcfa261708ce…
## $ driver_id <chr> "59a892c5568be44b2734f276", "59a135565e88a24b11f11…
## $ rider_id <chr> "59ad2d6efba75a581666b506", "59ce930f3d32b861760a4…
## $ start_time <dttm> 2017-10-01 00:00:17, 2017-10-01 00:02:34, 2017-10…
## $ src_lat <dbl> 41.07047, 40.94157, 41.07487, 41.04995, 41.05287, …
## $ src_lon <dbl> 29.01945, 29.11484, 28.99528, 29.03107, 28.99522, …
## $ src_area <fct> sxk9, sxk8, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ src_sub_area <fct> sxk9s, sxk8y, sxk9e, sxk9s, sxk9e, sxk90, sxk9e, s…
## $ dest_lat <dbl> 41.11716, 41.06151, 41.08351, 41.04495, 41.08140, …
## $ dest_lon <dbl> 29.03650, 29.02068, 29.00228, 28.98192, 28.98197, …
## $ dest_area <fct> sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sxk9, sx…
## $ dest_sub_area <fct> sxk9u, sxk9s, sxk9e, sxk9e, sxk9e, sxk97, sxk9q, s…
## $ distance <dbl> 5.379250, 15.497130, 1.126098, 4.169492, 3.358296,…
## $ status <fct> confirmed, confirmed, nodrivers, confirmed, confir…
## $ confirmed_time_sec <dbl> 8, 14, 0, 32, 65, 110, 0, 49, 27, 21, 23, 46, 185,…
ggcorr(train, label = T)
## Warning in ggcorr(train, label = T): data in column(s) 'id', 'trip_id',
## 'driver_id', 'rider_id', 'start_time', 'src_area', 'src_sub_area', 'dest_area',
## 'dest_sub_area', 'status' are not numeric and were ignored
summary(coverage_all_date)
## datetime src_area date day
## Min. :2017-10-01 00:00:00 sxk3:1512 Min. :2017-10-01 Sunday :648
## 1st Qu.:2017-10-16 17:45:00 sxk8:1512 1st Qu.:2017-10-16 Monday :648
## Median :2017-11-01 11:30:00 sxk9:1512 Median :2017-11-01 Tuesday :648
## Mean :2017-11-01 11:30:00 Mean :2017-11-01 Wednesday:648
## 3rd Qu.:2017-11-17 05:15:00 3rd Qu.:2017-11-17 Thursday :648
## Max. :2017-12-02 23:00:00 Max. :2017-12-02 Friday :648
## Saturday :648
## hour confirmed nodrivers total_transaction
## 0 : 189 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1 : 189 1st Qu.: 1.00 1st Qu.: 0.000 1st Qu.: 2.00
## 2 : 189 Median : 6.00 Median : 1.000 Median : 7.00
## 3 : 189 Mean : 47.32 Mean : 3.285 Mean : 50.60
## 4 : 189 3rd Qu.: 42.00 3rd Qu.: 3.000 3rd Qu.: 49.25
## 5 : 189 Max. :874.00 Max. :392.000 Max. :1240.00
## (Other):3402
## coverage
## insufficient:2454
## sufficient :2082
##
##
##
##
##
Meanwhile after several processing data we has make new data frame coverage_hourly with structure:
glimpse(coverage_hourly)
## Observations: 3,862
## Variables: 9
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk…
## $ date <date> 2017-10-01, 2017-10-01, 2017-10-01, 2017-10-01, 20…
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sun…
## $ hour <fct> 0, 1, 3, 4, 7, 8, 11, 12, 13, 14, 15, 16, 17, 18, 1…
## $ confirmed <dbl> 0, 1, 0, 0, 0, 1, 2, 0, 3, 7, 4, 2, 0, 5, 0, 0, 0, …
## $ nodrivers <dbl> 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 0, 1, 1, 0, 1, 3, 1, …
## $ total_transaction <dbl> 1, 2, 1, 1, 1, 3, 3, 1, 4, 8, 4, 3, 1, 5, 1, 3, 1, …
## $ datetime <dttm> 2017-10-01 00:00:00, 2017-10-01 01:00:00, 2017-10-…
## $ coverage <fct> insufficient, insufficient, insufficient, insuffici…
some variable in our train dataset still didint use in this new data frame, we can put into our new dataframe
int_var <- train_hourly %>%
group_by(src_area, date, day, hour) %>%
summarise(mean_distance = mean(distance),
sum_distance = sum(distance),
median_distance = median(distance),
mean_confirmed_time_sec = mean(confirmed_time_sec),
sum_confirmed_time_sec = sum(confirmed_time_sec),
median_confirmed_time_sec = median(confirmed_time_sec)) %>%
ungroup() %>%
select(-c(src_area, date, day, hour))
glimpse(int_var)
## Observations: 3,862
## Variables: 6
## $ mean_distance <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ sum_distance <dbl> 1.6709748, 16.8151311, 14.1027795, 3.633148…
## $ median_distance <dbl> 1.6709748, 8.4075656, 14.1027795, 3.6331489…
## $ mean_confirmed_time_sec <dbl> 0.00000, 22.00000, 0.00000, 0.00000, 0.0000…
## $ sum_confirmed_time_sec <dbl> 0, 44, 0, 0, 0, 12, 42, 0, 306, 302, 310, 5…
## $ median_confirmed_time_sec <dbl> 0.0, 22.0, 0.0, 0.0, 0.0, 0.0, 9.0, 0.0, 43…
from variable distance and confirmed_time_sec we do mean, median, sum for make new variable. we combine it with coverage_hourly
coverage_hourly <- cbind(coverage_hourly, int_var)
Cross-validation (CV) is a statistical method that can be used to evaluate the performance of models or algorithms where the data is separated into two subsets namely learning process data and validation / evaluation data. We already have train and test dataset, in this case i will use 3 dataset, is train, validation and test data.
We check proportion our existing train and test:
nrow(train)
## [1] 229532
nrow(test)
## [1] 504
total we have 229.532 data observation and our test data 504 target to predict. The proportion is already well, because we have more 99% data to train our model later.
This train dataset i will seperate into 2 dataset: data train and data validation. I will take 15% from data train to put into data validation.
set.seed(100)
intial_train <- initial_split(coverage_all_date, prop = 0.80, strata = "coverage")
train_training <- training(intial_train)
train_validation <- testing(intial_train)
prop.table(table(train_training$coverage))
##
## insufficient sufficient
## 0.5410468 0.4589532
train_training_upsample <- upSample(x = within(train_training, rm("coverage")), y = train_training$coverage, yname = "coverage")
prop.table(table(train_training_upsample$coverage))
##
## insufficient sufficient
## 0.5 0.5
Before we try to choose model, we have test dataset that give us rules that our model result should fit with our test structure. Lets take look data test we have
glimpse(test)
## Observations: 504
## Variables: 3
## $ src_area <chr> "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk3", "sxk…
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
summary(test)
## src_area datetime coverage
## Length:504 Min. :2017-12-03 00:00:00 Mode:logical
## Class :character 1st Qu.:2017-12-04 17:45:00 NA's:504
## Mode :character Median :2017-12-06 11:30:00
## Mean :2017-12-06 11:30:00
## 3rd Qu.:2017-12-08 05:15:00
## Max. :2017-12-09 23:00:00
Data test is contain 3 variable, this data test will be our prediction result, in this case we will try predict next 7 days (Sunday, December 3rd 2017 to Monday, December 9th 2017).
We have src_area it source area that user order driver. We can see datetime interval separate each 1 hour, it mean our prediction value will predict each hour or hourly for 7 days. We have column coverage variable, it will be our target variable contain sufficient and insufficient, based on our bussiness case we would to know insufficient as priority and it will positive.
Our variable to predict using data test strict only based on this 3 variable, for datetime we can extract it into hour interval.
test <- test %>%
mutate(src_area = as.factor(src_area),
date = date(datetime),
day = as.factor(wday(datetime, label = TRUE, abbr = FALSE)),
hour = as.factor(hour(datetime)))
glimpse(test)
## Observations: 504
## Variables: 6
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ datetime <dttm> 2017-12-03 00:00:00, 2017-12-03 01:00:00, 2017-12-03 02:00:…
## $ coverage <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ date <date> 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03, 2017-12-03,…
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour <fct> 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17…
Based on data test exploration we do above, the only predictor variable we can use only src_area, datetime,date,day,hour and for target still coverage with 2 class sufficient and insufficient. we can removing unsued variable from our data train and data validation we make before
train_training <- train_training %>%
select(-c("confirmed","nodrivers","total_transaction","datetime","date"))
train_validation <- train_validation %>%
select(-c("confirmed","nodrivers","total_transaction","datetime","date"))
train_training_upsample <- train_training_upsample %>%
select(-c("confirmed","nodrivers","total_transaction","datetime","date"))
For models to predict, we only have choice choose model that accept only factor, or categorical predictor variable and categorical target variable. I decided to compare 4 type model:
In this case, to compare each model we will use confusion matrix. Confusion matrix is a table that shows four different category: True Positive, True Negative, False Positive, and False Negative.
The performance will be the Accuracy, Sensitivity/Recall, Specificity, and Precision. Accuracy measures how many of our data is correctly predicted. Sensitivity measures out of all positive outcome, how many are correctly predicted. Specificty measure how many negative outcome is correctly predicted. Precision measures how many of our positive prediction is correct.
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
Logistic regression is a technique that is well suited for examining the relationship between a categorical response variable and one or more categorical or continuous predictor variables. we can check our structure of our dataset, we can see both of predictor or target is factor.
glimpse(train_training_upsample)
## Observations: 3,928
## Variables: 4
## $ src_area <fct> sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, sxk3, …
## $ day <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sund…
## $ hour <fct> 0, 3, 7, 8, 11, 12, 13, 14, 17, 19, 20, 21, 1, 7, 8, 10, 11,…
## $ coverage <fct> insufficient, insufficient, insufficient, insufficient, insu…
Logistict regression follow 3 assumtion before we use it:
Multicollinearity: between predictors do not correlate with each other, we can see our predictor contain categorical and each dont have corelate each other.
Independence of Observations: between observations are mutually independent & do not originate from repeated measurements. We can check ourtraining data actually extraction for hourly and it show historical and it didnt repeat, if repet for time and day it could have different area
Linearity of Predictor & Log of Odds: the way of interpretation refers to this assumption. for numerical variables, an increase of 1 value will increase the log of odds.
From total 3 assumption, 2 assumption we confidence our train dataset meet the requirment. So we can continue using our train dataset
We will train logistic regression model. at first we will use all our variabel inside train data frame, and use coverage as target.
logistict_model <- glm(coverage ~ ., train_training_upsample, family = "binomial")
summary(logistict_model)
##
## Call:
## glm(formula = coverage ~ ., family = "binomial", data = train_training_upsample)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.67425 -0.73517 0.00039 0.70405 2.69108
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.30543 0.19411 -1.573 0.115612
## src_areasxk8 2.45943 0.10611 23.179 < 2e-16 ***
## src_areasxk9 -1.14049 0.09777 -11.665 < 2e-16 ***
## day.L -0.26384 0.10837 -2.435 0.014912 *
## day.Q -0.30119 0.10796 -2.790 0.005274 **
## day.C 0.02223 0.10799 0.206 0.836872
## day^4 0.41647 0.10847 3.840 0.000123 ***
## day^5 0.16530 0.10786 1.533 0.125390
## day^6 -0.02934 0.10725 -0.274 0.784410
## hour1 0.10385 0.27392 0.379 0.704611
## hour2 0.68156 0.26648 2.558 0.010538 *
## hour3 0.68393 0.26233 2.607 0.009131 **
## hour4 1.44160 0.27778 5.190 2.11e-07 ***
## hour5 1.39752 0.27877 5.013 5.35e-07 ***
## hour6 0.99198 0.26675 3.719 0.000200 ***
## hour7 -0.64403 0.28296 -2.276 0.022843 *
## hour8 -2.09973 0.29788 -7.049 1.80e-12 ***
## hour9 -0.91742 0.27947 -3.283 0.001028 **
## hour10 0.11837 0.27001 0.438 0.661097
## hour11 0.09742 0.27242 0.358 0.720635
## hour12 -0.38025 0.28216 -1.348 0.177774
## hour13 -0.69846 0.27990 -2.495 0.012581 *
## hour14 -0.48210 0.28178 -1.711 0.087100 .
## hour15 -0.71407 0.28540 -2.502 0.012350 *
## hour16 -1.05535 0.28809 -3.663 0.000249 ***
## hour17 -1.12399 0.28381 -3.960 7.48e-05 ***
## hour18 -0.99653 0.28472 -3.500 0.000465 ***
## hour19 -1.10776 0.28312 -3.913 9.13e-05 ***
## hour20 -0.04379 0.27489 -0.159 0.873443
## hour21 0.09773 0.27038 0.361 0.717750
## hour22 0.69420 0.26831 2.587 0.009672 **
## hour23 0.24501 0.26759 0.916 0.359868
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5445.4 on 3927 degrees of freedom
## Residual deviance: 3711.9 on 3896 degrees of freedom
## AIC: 3775.9
##
## Number of Fisher Scoring iterations: 5
We have only 3 predictor variable, but why it show lot of variable? because glm function will automatically do dummy coding means is reoding the original categorical variable into a set of binary variables that have values of one and zero.
From summary we can analyze the fitting and interpret that the model is telling us. We can se average all of our predictor variable are statistivally significant. No matter some dummy coding variable have big p-value or more <0.05 we cant judge one by one we must sum it as one variable.
We have make our logistic regression using all predict variable on dataset. Next this model use to predict with valudation dataset.
logistict_prediction <- predict(logistict_model, train_validation, type = "response")
head(logistict_prediction)
## 2 3 5 17 23 30
## 0.4641338 0.6068290 0.7674631 0.2136755 0.6098418 0.7396008
Logistic regression return value within range of [0,1] and isnt binary class. We will convert probabilty into class using threshold value (by default we will put 0.5 as threshold value).
logistict_prediction <- as.factor(if_else(logistict_prediction >= 0.5, "sufficient", "insufficient"))
head(logistict_prediction)
## [1] insufficient sufficient sufficient insufficient sufficient
## [6] sufficient
## Levels: insufficient sufficient
As result we have categorical or class prediction, we use confusion matrix to evaluate the model
matrix <- confusionMatrix(logistict_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
logistict_matrix <- matrix_result(matrix, "Logistict Reg.")
logistict_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Logistict Reg. 79.9117 83.26531 75.96154 80.31496
These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.
The result show our prediction on validation dataset using logistic regression model is 81.6% for accuracy, it mean that our result data prediction 81.6% is correctly classified. Precision/positive predicted value around 80.60%, mean our positive prediction is correctly classified. Value of sensitivity is 86.95% and specificity 75.32%, this indicate positive and negative predict correctly classified range around 75% - 86%.
Naive Bayes algorithm, in particular is a logic based technique which is simple yet so powerful that it is often known to outperform complex algorithms for very large datasets.Historically, this technique became popular with applications in email filtering, spam detection, and document categorization. There are certain characteristics of Naive Bayes that should be considered:
Although it is often outperformed by other techniques, and despite the naive design and oversimplified assumptions, this classifier can perform well in many complex real-world problems. And since it is a resource efficient algorithm that is fast and scales well, we will implement it to our case.
We need build our model, and in this case we also apply Laplace Estimator. At first naive bayes model, we will use all predictor variable we have:
naive_model <- naiveBayes(coverage ~ ., data = train_training_upsample, laplace = 1)
Next this model use to predict with valudation dataset, and using confussion matrix
naive_prediction <- predict(naive_model, train_validation)
matrix <- confusionMatrix(naive_prediction, train_validation$coverage, positive = "insufficient")
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
naive_matrix <- matrix_result(matrix, "Naive Bayes")
naive_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes 80.3532 85.5102 74.27885 79.65779
These table explain:
* Accuracy: the ability to correctly predict both classes from the total observation.
* Precision: the ability to correctly predict the positive class from the total predicted-positive class (false positive is low).
* Recall: the ability to correctly predict the positive class from the total actual-positive class (false negative is low).
* Specificity: the ability to correctly predict the negative class from the total actual-negative class.
There is way to improve our naive bayes, before we do it we will check ROC curve from our naive bayes:
naive_prediction_raw <- as.data.frame(predict(naive_model, train_validation, type = "raw"))
# ROC
naive_roc <- data.frame(prediction = naive_prediction_raw[,2],
trueclass = as.numeric(train_validation$coverage=="sufficient"))
naive_roc_pred <- prediction(naive_roc$prediction, naive_roc$trueclass)
# ROC curve
plot(performance(naive_roc_pred, "tpr", "fpr"),
main = "ROC")
abline(a = 0, b = 1)
The ROC is a curve that shows the performance of the classification model for all thresholds. ROC represents graphics from:
Recall / sensitivity (true positive rate) on they axis.1-specificity (false positive rate) on thex axis.Next we calculate AUC
# AUC
auc <- performance(naive_roc_pred, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.8481113
AUC is the area under the ROC curve. AUC values indicate the success of the model predicting / differentiating the two classes. The area value has an interval between 0 to 1. If the AUC value is close to 1, it means that the classification model is able to predict / distinguish the two classes well. However, if the AUC value is close to 0.5 it means that the classification model is not able to predict / distinguish the two classes well.
* AUC value close to 1: the classification model is able to predict / distinguish the two classes well
* AUC values close to 0.5: the classification model is not able to predict / distinguish the two classes well.
Based on result ROC and AUC, we get our ROC curce show good separation with AUC score 0.8436585 it mean we have chance to improve this Naive Bayes Model.
In our case we are focus “sufficient” class mean we want predcit when scotty success full fullfill all order from users, it means we are focus on Precision parameter which mean our Naive Bayes already give us good result.
Based on our AUC score we still can improve our model, and one of method to tuning model is change the threshold.
# model tuning - metrics function
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
# apply function metrics
for(i in 1:100){
result[i,] = metrics(cutoff = co[i],
prob = naive_prediction_raw$sufficient,
ref = as.factor(ifelse(train_validation$coverage=="sufficient", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Recall" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))
From plot above, we get information break even point is around 0.45. we will set it around 0.49 because we are focusing on Precision parameter, meanwhile we still want other paramaeter get good result.
#Tuning Threshold
naive_prediction_tuning <- naive_prediction_raw %>%
mutate(label = as.factor(ifelse(sufficient >= 0.49, "sufficient", "insufficient"))) %>%
select(label)
naive_matrix_tuning <- confusionMatrix(naive_prediction_tuning$label, train_validation$coverage, positive = "insufficient")
naive_matrix_tuning <- matrix_result(naive_matrix_tuning, "Naive Bayes Tuning")
naive_matrix_tuning
## Model Accuracy Sensitivity Specificity Precision
## 1 Naive Bayes Tuning 80.3532 85.10204 74.75962 79.88506
Decision tree model is one of the tree-based models which has the major benefit of being interpretable. Decision tree is an algorithm that will make a set of rules visualized in a diagram that resembles a tree. There are certain characteristics of decision tree model:
let build decision tree model
dtree_model <- ctree(formula = coverage ~., data = train_training_upsample)
plot(dtree_model)
dtree_prediction <- predict(dtree_model, train_validation)
matrix <- confusionMatrix(dtree_prediction, train_validation$coverage)
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
dtree_matrix <- matrix_result(matrix, "Decision Tree")
dtree_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Decision Tree 80.7947 87.34694 73.07692 79.25926
predict_result <- predict(dtree_model, test)
ctrl <- trainControl(method = "repeatedcv", number = 5, repeats = 5)
randomforest_model <- train(coverage ~ ., data = train_training_upsample, method = "rf", trControl = ctrl, ntree = 100)
randomforest_prediction <- predict(randomforest_model, train_validation)
matrix <- confusionMatrix(randomforest_prediction, train_validation$coverage)
table <- as.table(matrix)
table <- as.data.frame(table)
confussionMatrixPlot(table)
randomforest_matrix <- matrix_result(matrix, "Random Forest")
randomforest_matrix
## Model Accuracy Sensitivity Specificity Precision
## 1 Random Forest 79.9117 84.4898 74.51923 79.61538
tbl = matrix(data=c(51, 49, 24, 26), nrow=2, ncol=2, byrow=T)
dimnames(tbl) = list(City=c('B', 'T'), Gender=c('M', 'F'))
chi2 = chisq.test(tbl, correct=F)
c(chi2$statistic, chi2$p.value)
## X-squared
## 0.1200000 0.7290345
sqrt(chi2$statistic / sum(tbl))
## X-squared
## 0.02828427
test$coverage <- predict_result
test <- test %>%
mutate(coverage = as.factor(if_else(coverage == 1,"sufficient","insufficient"))) %>%
select(-c(date,day,hour))
write.csv(test, "data/data-test-result.csv")
mean_distance <- train_hourly %>%
group_by(src_area, date, day, hour) %>%
summarise(mean_distance = mean(distance)) %>%
ungroup() %>%
select(mean_distance)
coverage_hourly <- cbind(coverage_hourly, mean_distance)